.Rmd file on Moodle together
with the .html file - that is produced when you click on
the “Knit” button. If you are having issues knitting your file do get in
touch (and knit regularly to make sure everything is working
properly). Your files must be named
“midterm_metrics_name1_name2.Rmd” if you are in group or
midterm_metrics_name1.Rmd” otherwise..Rmd file for you to fill
out. Just put your code within the marked region, and click the
Knit button in RStudio, just above this script, to see the
generated HTML output. Knit it now, before you add any code to see how
nice the output looks like.eval = FALSE by eval = TRUE in
the chunk “header”. This will make sure that the chunk is ‘evaluated’
(i.e., that the code runs) when you knit the document.# code goes here
.Rmd) here.This midterm project consists in one long analysis of data from the 2020 Presidential elections in the United States.
You will be asked to download the data from its original source, just like you would if you were to undertake this task by yourself. You will use everything we’ve learned so far: importing (un-tidy) data; summarizing, visualizing and tidying it; running regressions for different variables; and thinking about whether these associations can be interpreted causally. In the process, you will also learn 2 useful operations: merging 2 (or more) datasets together, and “reshaping” data from wide to long format. Don’t worry we will work you through how to do all of this.
We hope you find this project interesting and useful. Now let’s get going!
It’s good practice to load all the packages you will need in the same place.
(Once packages are installed, remember to set
eval = FALSE to eval = TRUE in the chunk
below.)
# Load all the packages you need here
# Don't write the install.packages() code here. Run that code in the console.
library(magrittr)
library(readxl)
library(readr)
library(tidyverse)
library(janitor)
library(sf)
library(tmap)
library(skimr)
library(jtools)
library(rmapshaper)
library(huxtable)
Download the data from November 2020 US presidential elections from here:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ#
Scroll down, click on the download icon (Access File) next to the second file (countypres_2000-2020.tab), and select “Comma Separated Values (Original File Format)”.
This is straight from the MIT Election Data + Science lab, which collects and shares election data from the United States. This dataset contains election results aggregated at the county level. Counties are small geographic units in the United States.
1*. Load the file you downloaded
into RStudio in an object called elections. You can use the
read.csv function.
elections <- read.csv("countypres_2000-2020.csv")
1. Use tail to display the last 10 rows of the
dataset
tail(elections)
2. What years are included in our data? What’s inside the enigmatic
“mode” variable? Tabulate the variables year and
mode separately to find out.
table(elections$year)
table(elections$mode)
years: 2000, 2004, 2008, 2012, 2016, 2020 mode: ELECTION DAY, ABSENTEE, EARLY VOTING, EARLY VOTE, PROVISIONAL, FAILSAFE, ONE STOP, etc.
3. What is the unit of observation? In other words, what does each row correspond to?
Each row corresponds to an individual vote cast in a specific election year, country, and a voting mode.
4. How many variables are there?
ncol(elections)
12
5. How many rows are there?
nrow(elections)
72617
1*. This data contains the results
of all presidential elections from 2000 to 2020. We only want to analyse
the 2020 results. Use the filter function to keep only the
observations that concern the 2020 election. Create a new object
elections_2020 for this task.
library(dplyr)
elections_2020 <- elections %>%
filter(year == 2020)
2. Our main variable of interest is candidatevotes. In
how many counties did a candidate get less than 10 votes (less or equal
10 )? How many different candidates were concerned?
elections_2020 %>%
filter(candidatevotes <= 10) %>%
count
4805
3. We saw previously that there were different modes of voting
recorded in the data, and that one category is “Total”. This could be
worrying: are we counting ballots twice? Could a ballot appear both in
the “in person” and “total” categories? Cross-tabulate the
mode variable with state, to understand
whether the modes of voting are the same in all States. (Use the
table function)
table(elections_2020$mode, elections_2020$state)
As you can see, in some States the mode variable takes
exclusively the value “TOTAL” (in Alabama for instance). In those
States, we know the total number of votes obtained by candidates in each
county. In other States, we do not have the total number of votes, but
only a number disaggregated by voting method (in Arizona for
instance).
All of those situations are fine, since they mean ballots are not counted twice.
However there is one worrying case: Utah. In this State, we observe both the total number of votes, and the number of votes by mail, election day voting, and early voting. By looking more closely at the data, you can see that in the Salt Lake county (code 49035), we have the number of votes received early; on election day; and by mail. For all other counties, the ballot count for these voting modes is zero, and we only observe the “total” ballot count. If you wanted to be convinced more formally that this is true, you could run the following script:
elections_2020 %>%
filter(state=="UTAH") %>% #keep only UTAH observations
mutate(mode_vote_total = (mode=="TOTAL" & candidatevotes!=0), #create a dummy if we observe a ballot in the "total" category
mode_vote_other = (mode!="TOTAL" & candidatevotes!=0)) %>% # create a dummy if we observe a ballot in any other voting category
group_by(county_fips, county_name) %>% # group by county
summarise(mode_vote_total = max(mode_vote_total), #create a dummy if for any candidate in the county we observe a ballot in the "total" category
mode_vote_other = max(mode_vote_other)) %>% #create a dummy if for any candidate in the county we observe a ballot in any other voting category
View() #visualise the resulting dataset
#You can see that in no county do we observe a ballot both in "total" and in another category!
So it holds true that we never count a ballot twice: in every county, either we count it in the “total” category, or we count it in one of the categories associated to specific voting method. This is good news!
1. In States like Arizona where we don’t have the total number of ballots, it is not clear that the voting modes for which we have data cover all voting methods available.
For instance, in Kentucky we only have the number of ballots cast on election day. A quick search online shows that early voting is possible in this State, although it is heavily restricted. It is hence very likely that some ballots are missing in our data.
Now that we noticed this feature of the data, we want to check how big of an issue this is: how many votes are not counted in our data? Can we still reasonably use it, or is its quality too bad?
To answer this question, we will compare the total number of votes by State in our dataset to the official number provided by the Federal Government.
Create a new object called elections_2020_state_totals,
with two variables: state_po and
sum_votes_state: the sum of candidatevotes by
State. Hint: you should group your data by state_po, and
then use summarise. You should also disregard missing
values when you compute the sum of candidatevotes: find out
the argument of the sum function which lets you do
that.
elections_2020_state_totals <- elections_2020 %>%
group_by(state_po) %>%
summarise(sum_votes_state = sum(candidatevotes))
2. The official election data in the US is provided by the Federal Election Commission, in this document. Notice that it is in pdf format. We already extracted it into an excel file for you.
Download the file here
Import it into a new object named official_state_votes,
using the read_excel function from the readxl
package.
library(readxl)
official_state_votes <- read_excel("state_votes.xlsx")
3. Both your elections_2020_state_totals and
official_state_votes should have 51 observations, one per
State in the US + Washington DC. But now the official number of votes
and the one from our dataset are in two different R objects. You could
compare the numbers manually, but this would be a lot of work. Instead,
we want to merge them into a single object.
We will do this using the left_join() function from the
tidyverse package.
This is how left_join() works:
left_join(x,y) will return all rows from x
(very important!) and all columns from x and
y. Rows in x with no match in y
will have NA values in the new columns. Rows in
y with no match in x will be discarded.
The following cheatsheet might help you visualise the way the function works: https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf. You can also find more details here.
If nothing is specified, left_join will automatically
detect the variables which have the same name in both datasets and use
them to join the data. If your variables have different names in the
datasets, you can specify
left_join(x,y, by=c("var1x" = "var1y", "var2x" = "var2y")).
This will match variable var1x in x to
var1y in y and var2x in
x to var2y in y. (You can keep
going, and join on more than two variables!)
Create a new object called
elections_2020_state_comparison, which is the
left_join of elections_2020_state_totals (1st
argument) and official_state_votes (2nd argument). Here,
our identifying variables have different names, so we need to add the
argument by=c("state_po" = "state_short")
elections_2020_state_comparison <- left_join(elections_2020_state_totals, official_state_votes, by=c("state_po" = "state_short"))
Congratulations, you have merged your first datasets! Easy right? :)
4. Compute the percentage difference between the official number of
votes, and the sum of votes in our data by State:
(official votes - sum votes)/official votes*100. What are
the min, max and mean of this percentage difference?
elections_2020_state_comparison <- elections_2020_state_comparison %>%
mutate(percent_diff = (official_total_votes - sum_votes_state)/official_total_votes*100)
elections_2020_state_comparison %>%
summarise(
min_diff = min(percent_diff),
max_diff = max(percent_diff),
mean_diff = mean(percent_diff)
)
min: -0.925, max: 0.545, mean: -0.044
If our election data contained the official number of votes per State, the difference would always be zero. Clearly, this is not the case: in some States our data has too few votes, in others it has too many. However, the difference is always very very small and should not dramatically change the final results obtained by each candidate.
Still, the discrepancy is not easy to explain: the MIT lab collects official results from all the States and puts them together into a national file. It is not clear where the difference is coming from: maybe there was human error in data collection, maybe the results were updated in some States after a recount…
Since the United States government itself does not provide official county-level results for the entire country, the only alternative to using the MIT data would be to start from State level official results. However, this would be a lot of work as we would have to find all the files, deal with format issues (many data files are still provided in pdf, for instance in Kentucky) and clean them all. It would be time consuming, and we would likely make more errors than the MIT lab!
1. Now that we have decided that our data was good enough, we still
need to tidy it before we can use it. Make sure that the candidate name
is always written with the same format by tabulating the
candidate variable.
table(elections_2020$candidate)
2*. Remember that we noticed earlier
there were missing values in the candidatevotes and
totalvotes variables.
Run the following script to fix the issue (replace
eval = FALSE by eval = TRUE below, so that the
code is also run when you knit). This chunk of code:
candidate_votes by zero when it is
missingtotal_votes_county by the sum of votes in the
county when it is missing.Note that we create a new object,
elections_2020_clean.
The ifelse function works in the following way:
ifelse(test, value if test==TRUE, value if test==FALSE).
For instance in the third line below, for each observation, if
candidatevotes is missing then it is replaced by 0,
otherwise it is not changed.
elections_2020_clean <- elections_2020 %>%
group_by(county_fips) %>%
mutate(candidatevotes = ifelse(is.na(candidatevotes), 0, candidatevotes),
totalvotes = ifelse(is.na(totalvotes), sum(candidatevotes), totalvotes))
3*. Clean some more the
elections_2020_clean object by:
Summarising the data so that the unit of observation is a
candidate in a county. To do this, you need to sum the number of votes
obtained for each candidate in each county, across all modes of voting).
Note: in order to keep all the variables, you will need to use the
following command
group_by(state, state_po, county_name, county_fips, totalvotes, candidate).
If we only grouped by county_fips, candidate, we would
“lose” all the other “extra” variables that give us information about
the county (its name, its state, the number of votes, …). If this is not
clear to you, simply run the code twice, once grouping by all the
variables, and once with group_by(county_fips, candidate),
and see what happens!
Creating a variable pct_votes which is equal to the
percentage of votes obtained by a candidate in a county: (number of
votes obtained by a candidate / total number of votes in
county)*100.
Grouping the data by county (county_fips), and then
use a combination of rank() and desc()
commands to compute the ranks of each candidate within each county. You
should create a variable called candidate_rank which takes
the value 1 for the candidate who received the most vote in the county,
2 for the candidate who received the second highest number of votes
etc.
Once you’ve ensured your code is correct, “ungroup” the data by
passing your object into the ungroup() function or by
piping it. This function does what it says: it removes grouping. This
means that it cancels the effect of your last group_by, and
ensures that the next functions you use will not be applied by
group.
You can pipe all these commands together if you wish or proceed in several steps.
elections_2020_clean <- elections_2020_clean %>%
group_by(state, state_po, county_name, county_fips, totalvotes, candidate) %>%
summarise(votes = sum(candidatevotes), .groups = "drop") %>%
mutate(pct_votes = votes / totalvotes * 100) %>%
group_by(county_fips) %>%
mutate(candidate_rank = rank(desc(votes))) %>%
ungroup()
4. Check that your variable pct_votes is actually a
percentage (that it is comprised within 0 and 100). This is a safety
check to make sure that there is no inconsistency in the data.
elections_2020_clean %>%
filter(pct_votes < 0 | pct_votes > 100)
# no rows are returned => `pct_votes` is good
Election data are probably better visualized using maps.
In this section we will produce a map displaying Joe Biden and Donald Trump’s vote percentages in each county
To produce the map you will need to install and load the
sf, tmap and rmapshaper
packages.
Maps are tricky and very complicated objects. But to boil it down to the essential ingredients, you need a “shapefile” which is basically a file that contains the map onto which data will be plotted.
This file is provided by the US census website here.
However it is somewhat heavy, so to make your life easier, we have
simplified it for you using the following commands. Do
NOT touch nor run this chunk, and do NOT
replace eval = FALSE by eval = TRUE. We’ve
left it so you can see how we created the simplified shapefile you will
be loading in the next question.
county_shfl_full <- read_sf("path/cb_2018_us_county_500k/cb_2018_us_county_500k.shp")
# where `path` corresponds to where the `cb_2018_us_county_500k` folder is located.
county_shfl <- st_simplify(county_shfl_full, preserveTopology = FALSE, dTolerance = 1000) #simplify the map to make it less memory-heavy
save(county_shfl, file="path/county_shfl.RData") #saving the object
You can download the resulting R object from this link: > https://www.dropbox.com/s/m8zq5e38imp2he1/county_shfl.RData?dl=1
1. Let’s load the county_shfl.RData file by using the
load() function. Replace eval = FALSE by
eval = TRUE.
You should now have an object called county_shfl in your
environment
2. FIPS codes are numbers which uniquely identify geographic areas in the US. Working with FIPS codes is always advisable when identifying areas, as names can be duplicates: there is an Aidar county in Kentucky and another one in Iowa.
What is the county FIPS code of the “FLEMING” county in the
elections_2020_clean data.frame? What are the State FIPS
(STATEFP), county FIPS (COUNTYFP) and
GEOID of the same county in the county_shfl
data? Hint: in county_shfl the county’s name is written
“Fleming” and the variable containing county names is
NAME.
elections_2020_clean %>%
filter(county_name == "FLEMING") %>%
select(state, county_name, county_fips)
county_shfl %>%
filter(NAME == "Fleming") %>%
select(STATEFP, COUNTYFP, GEOID, NAME)
county FIPS code: 21069 State FIPS
(STATEFP): 21 county FIPS
(COUNTYFP): 069 GEOID:
21069
3. You might have noticed that the county_fips variable
in our elections_2020_clean is actually made of the State
FIPS code, followed by the county code.
In county_shfl, we have a variable for the State FIPS
code and a separate one for the county code. The variable containing the
same code as in elections_2020_clean is actually called
GEOID. This type of things is exactly why you always need
to look inside your data: it would not have been obvious from the
variable names.
What is the type of county_fips in the
elections_2020_clean data? What is the type of
GEOID in county_shfl? Hint: use the
typeof function.
typeof(elections_2020_clean$county_fips)
typeof(county_shfl$GEOID)
integer, character
It is not the same! In order to merge the two datasets, we want
county_fips to have the same type as in GEOID
and be a character string instead.
Furthermore, because it is stored as a numeric value there is no leading zero (the FIPS code “01001” is stored as 1001).
The following script fixes this, and creates a new variable called
county_fips_clean. Run the code and change
eval = FALSE to eval = TRUE to make sure the
code runs when you knit your document.
We start by using the as.character function to convert
the variable to a character string. Then, we use the ifelse
function again. What this conditional statement says is: if the number
of characters of the variable county_fips_clean is equal to
4, then add a “0” in front of this variable; otherwise do nothing.
elections_2020_clean <- elections_2020_clean %>%
mutate(county_fips_clean=as.character(county_fips), #convert county_fips to a character string
county_fips_clean=ifelse(nchar(county_fips_clean)==4,
paste0("0", county_fips_clean),
county_fips_clean)) #Add a leading zero when necessary: if there are only four characters to the FIPS code, it means that the leading 0 is missing
4. How many different State FIPS codes are there in the
county_shfl dataset? Hint: You can use a combination of the
length and unique commands to count them.
length(unique(county_shfl$STATEFP))
56
5. Our shapefile clearly contains the map for more than just the 50 US States + DC. By looking at the existing State codes here, you can see that codes above 56 correspond to overseas territories.
The final map we want to produce should only contain the 50 States + DC: we want to remove overseas territories. We also exclude Alaska and Hawaii for purely aesthetic reasons: including them zooms out the map too much. We could fix this problem by showing Alaska and Hawaii in a corner of the map but this is beyond the scope of the midterm.
Filter county_shfl_clean to keep only observations such
that:
STATEFP is not equal to “02” (Alaska) or “15”
(Hawaii)as.numeric(STATEFP) is lower or equal to 56 (note
that because State FIPS is stored as a character, we have to convert it
to a numeric before evaluating whether it is below 56)Create a new object called county_shfl_clean.
county_shfl_clean <- county_shfl %>%
filter((STATEFP != "02" & STATEFP != "15") & (as.numeric(STATEFP) <= 56))
Furthermore, some counties’ FIPS have changed over the years, and
they differ in our two datasets. The following script will fix all these
inconsistencies for you. The case_when function is similar
to ifelse. Here it modifies the variable
county_fips_clean, such that :
county_fips_clean was equal to “46113”, it now takes
the value “46102”state_po is equal to “DC”,
county_fips_clean now takes the value “11001”county_fips_clean keeps the same value.Remember to replace eval = FALSE with
eval = TRUE.
elections_2020_clean <- elections_2020_clean %>%
mutate(county_fips_clean = case_when(
county_fips_clean=="46113" ~ "46102",
state_po=="DC" ~ "11001",
TRUE ~ county_fips_clean))
6. Use left_join to merge county_shfl_clean
(first argument) and elections_2020_clean (second
argument). Create a new object called elections_2020_map
for this question.
Here, our identifying variables have different names:
GEOID in county_shfl_clean (first argument)
and county_fips_clean in elections_2020_clean
(second argument). Look at the help file or go back to question 3 of the
“Checking quality” section to see how to specify the by argument.
elections_2020_map <- left_join(county_shfl_clean, elections_2020_clean, by=c("GEOID" = "county_fips_clean"))
7. Create a heat map of Biden and Trump’s percentage of the vote by
county, using the following code (change eval = FALSE to
eval = TRUE just below so that the html file displays the
map):
tmap_mode("view") # we want an interactive map
elections_2020_map %>%
# only keep Trump and Biden
filter(candidate %in% c("DONALD J TRUMP", "JOSEPH R BIDEN JR")) %>%
tm_shape() +
tm_borders(col="white", lwd = 0.3) + # white and thin (line width) borders
tm_fill(
col = "pct_votes", # variable to be mapped
title = "% of votes", # legend title
id = "county_name", # information to display when mouse hovers over a departement
popup.vars = c("Vote %:" = "pct_votes")) + # variable to display in popup window
tm_facets(by = "candidate") # create one map per selected candidate